Dados e R
Data Wrangling & DataViz

Encontro 2 | 19/08/20024
Henrique Costa | Métodos Estratégicos em FinQuant

Dados e R

Funções

  • Receitas permitem que os chefs preparem guloseimas saborosas
    • As receitas pedem ingredientes
    • Recipes involve one or more steps
    • As etapas transformam os ingredientes em guloseimas
  • Funções são como receitas personalizáveis
    • Funções solicitam entradas (“argumentos”)
    • As funções envolvem uma ou mais linhas de código
    • O código transforma entradas em saídas
    • O uso de funções requer parênteses (geralmente)

doce <- f(ing1, ing2)

Funções - Prática

# CASO DE USO: A função pode executar uma tarefa de forma mais fácil e legível

# MODELO: saída <- nome_da_função(entrada)

9 ^ (1 / 2)

x <- sqrt(9)
x

# ==============================================================================

# LIÇÃO: Também podemos usar funções para transformar objetos

y <- 9

sqrt(y)

# ==============================================================================

# LIÇÃO: Podemos até usar funções para transformar o resultado dos cálculos

2 / 3

round(2 / 3)

# ==============================================================================

# LIÇÃO: Podemos personalizar o que uma função faz usando argumentos

# MODELO: saída <- nome_da_função(argumento, nome_do_argumento = valor_do_argumento)


round(2 / 3, digits = 2)

round(2 / 3, digits = 3)

# ==============================================================================

# LIÇÃO: Alguns argumentos são opcionais porque têm valores padrão

round(2 / 3) # the default value for digits is 0

round(2 / 3, digits = 0)

Vectores

  • Vetores combinam objetos semelhantes em uma coleção
    • Gosto de imaginar um trem puxando vários vagões
    • Um vetor é um objeto com muitos subobjetos
    • Nós nos referimos a cada subobjeto como um elemento
  • Algumas funções transformam cada elemento um de cada vez
    • Dobrar a quantidade de carga em cada vagão
  • Algumas funções resumir em todos os elementos
    • Calcule a carga total em todos os vagões do trem


v <- c(1, 2, 3, 4, 5)

Vetores - Prática

# LIÇÃO: Podemos combinar vários elementos em um vetor

# MODELO: nome_do_vetor <- c(elemento1, elemento2, elemento3)

x <- 4 9 16 25 # error

x <- c(4, 9, 16, 25)
x

y <- c(2, 3)
y

# ==============================================================================

# LIÇÃO: Também podemos combinar vários vetores e elementos

c(x, y)

c(x, y, 20)

# ==============================================================================

# CASO DE USO: Operadores matemáticos transformarão cada elemento individualmente

x + 1

x * 3

x # mas, novamente, isso não será salvo a menos que você use atribuição

# ==============================================================================

# CASO DE USO: Algumas funções também transformarão cada elemento individualmente

sqrt(x)

log(x)

# ==============================================================================

# CASO DE USO: Outras funções resumirão o vetor com um único número

length(x)

sum(x)

mean(x)

Strings

  • Ao programar com R, precisamos de uma maneira de distinguir
    • Nomes de objetos/funções (por exemplo, a função mean)
    • Dados de texto/caractere (por exemplo, a palavra mean)
  • Strings são a maneira do R armazenar dados de texto
    • Strings podem armazenar qualquer caractere (sem regras!)
    • As strings são criadas e exibidas com quotes

::::

Strings

  • R tem ótimas ferramentas para trabalhar com strings
    • As strings podem ser coletadas em vetores
    • Funções especiais podem transformar strings

name <- "John Doe"

:::

Strings - Prática

# CASO DE USO: Strings são a principal maneira de armazenar dados de caracteres em R
 
my_color <- red # error

my_color <- "red" # correto

# ==============================================================================

# CASO DE USO: Strings também podem armazenar símbolos não permitidos em nomes de objetos

dye <- "red#40"
dye

dyes <- c("red#40", "blue#02")
dyes

# ==============================================================================

# ARMADILHA: Muitas operações que você pode fazer com números não funcionarão para strings

dyes + 1 # error

mean(dyes) # error

# ==============================================================================

# CASO DE USO: Mas outras operações funcionam para ambos ou mesmo apenas para strings

length(dyes)

nchar(dyes)

dyes2 <- toupper(dyes)
dyes2

Packages (Pacotes)

  • Livros de receitas são uma ótima maneira de aprender a cozinhar
    • Eles contêm muitas receitas e instruções
    • Navegue em uma livraria online para encontrar um livro de receitas
    • Encomende para adicionar à sua estante pessoal
    • Para usar, retire o livro de receitas da prateleira

Packages (Pacotes)

  • Pacotes são como livros de receitas para R
    • Eles contêm funções e conjuntos de dados úteis
    • Navegue em um repositório online para um pacote
    • Instale para adicioná-lo à sua biblioteca pessoal
    • Para usar, carregue o pacote da biblioteca

library("pkg_name")

Packages - Prática

# CASO DE USO: O pacote stringr adiciona uma função para corrigir a capitalização

students <- c("mary anne", "BENjamin", "Lee")

# ==============================================================================

# ARMADILHA: Mas não podemos usar essa função sem instalar o pacote

str_to_title(students) # error

# ==============================================================================

# LIÇÃO: Instalando um pacote usando RStudio

# - RStudio > Extras pane > Packages tab > Install button

# ==============================================================================

# ARMADILHA: Também precisamos carregar o pacote antes de podermos usá-lo

str_to_title(students) # error

# ==============================================================================

# LIÇÃO: Carregamos o pacote usando library()

library("stringr")
str_to_title(students) # pronto isso funciona!

# ==============================================================================

# LIÇÃO: Também podemos manter nossos pacotes atualizados usando o RStudio

# RStudio > Extras pane > Packages tab > Update button

Wrangle I

Princípio de Dados Tidy (Arrumados)

  • Existem muitas maneiras de armazenar dados
  • Aprenderemos o formato tidy data
    • Os dados devem ser retangulares
    • Cada variável tem sua própria coluna
    • Cada observação tem sua própria linha
    • Cada valor tem sua própria célula

Outros conselhos sobre dados

  • Nomeie todas as variáveis na primeira linha
    • Isso é chamado de linha de cabeçalho
  • Evite células mescladas para armazenamento de dados
    • Estes são bons para comunicação
  • Evite células vazias sempre que possível
    • Marcar dados ausentes como NA
  • Evite formatação como dados para armazenamento
    • por exemplo, codificação de cores não redundante

Arrumando (Tidying) exemplo 1

Não Arrumado (Tidy)

Nome Ann Bob Cat Dom
Idade 13 10 11 11
Peso 56.4 46.8 41.3 43.3

❌ Aqui, cada linha é uma variável e cada coluna é uma observação.

Arrumado (Tidy)

Nome Idade Peso
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Aqui, cada coluna é uma variável e cada linha é uma observação.

Arrumando (Tidying) exemplo 2

Não arrumado (Tidy)

Nome: Ann Bob Cat Dom
Idade Peso
13 56.4
10 46.8
11 41.3
11 43.3

❌ Aqui, temos dados que não são retangulares porque a variável Names tem sua própria linha.

Arrumado (Tidy)

Nome Idade Peso
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Aqui, tornamos os dados retangulares movendo a variável Nomes para sua própria coluna.

Arrumando (Tidying) exemplo 3

Não arrumado (Tidy)

country year cases / population
Afghanistan 1999 NA / 19987071
2000 2666 / 20595360
Brazil 1999 37737 / 172006362
2000 80488 / 174504898
China 1999 212258 / 1272915272
2000 213766 / 1280428583

❌ Aqui, mesclamos células e dois valores armazenados em uma única célula.

Arrumado (Tidy)

country year cases population
Afghanistan 1999 NA 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

✔️Aqui, desfazemos a mesclagem dos países e separamos as variáveis de casos e populações em colunas.

Arrumando (Tidying) exemplo 4

Não arrumado

student grade
Amber 91.5 A-
Bristol 86.2 B
Charlene 94.0 A
Diego 89.3 B+
Legend: Psych. Major, Psych. Minor

❌ Aqui, temos um nome de variável ausente e formatação como dados.

Arrumado

student psych grade letter
Amber major 91.5 A-
Bristol minor 86.2 B
Charlene major 94.0 A
Diego NA 89.3 B+

✔️ Aqui, adicionamos uma coluna para a variável psych, removemos a legenda e nomeamos a variável letter.

Arrumando (Tidying) exemplo 5

Não arrumado

student grade letter
Amber 91.5 A-
Bristol* 94.2 A
Class Summary
As 2 Yay!
Bs 0
*Grade was revised.

❌ Aqui, temos dois tipos de dados em um arquivo e uma nota de rodapé como dados.

Arrumado

student grade letter revised
Amber 91.5 A- FALSE
Bristol 94.2 A TRUE
letter count notes
A 2 Yay!
B 0

✔️ Aqui, dividimos os dados em duas tabelas separadas e adicionamos as variáveis revisadas e notas.

Tabelas (Tibbles)

  • R funciona particularmente bem com dados organizados
  • Armazenamos dados organizados em data frames ou tibbles
    • Tibbles são apenas tabelas mais sofisticadas
      (ou seja, eles têm alguns recursos extras)
  • Para usar tibbles, precisamos do pacote tidyverse
  • Tibbles são construídos a partir de um ou mais vetores
    • Os vetores devem ter o mesmo comprimento
    • Eles podem conter diferentes tipos de dados

Vetores

Começamos com três objetos vetoriais separados que têm todos o mesmo comprimento.

Nós configuramos para que o \(n\)-ésimo vagão em cada trem corresponda à mesma observação.

Tibble

Então combinamos os vetores em um único objeto tibble (ou data frame).

Agora, conforme o tibble se move, as variáveis sempre ficam juntas.

Tibbles - Prática

# CONFIGURAÇÃO: Instale e carregue o pacote tidyverse

# Extras pane > Packages tab > Install

library(tidyverse)

# ==============================================================================

# LIÇÃO: Crie um tibble a partir de vetores

x <- c(10, 20, 30, 40)
x

y <- x * 2 - 4
y

my_tibble <- tibble(x, y)
my_tibble

# ==============================================================================

# CASO DE USO: Você pode misturar diferentes tipos de vetores em um único tibble

first_names <- c("Adam", "Billy", "Caitlyn", "Debra")

age_years <- c(12, 13, 10, NA)

guests <- tibble(first_names, age_years)
guests

# ==============================================================================

# DICA: Para economizar tempo, você também pode criar os vetores na chamada tibble

gradebook <- tibble(
  grade = c(95, 83, 90, 76),
  letter = c("a", "b", "a-", "c")
)
gradebook

# ==============================================================================

# ARMADILHA: Não tente combinar tibbles com comprimentos diferentes

y <- c(1, 2, 3)
x <- c("a", "b")

tibble(y, x) #error

# ==============================================================================

# LIÇÃO: No entanto, a exceção é que R irá "reciclar" um único valor

tibble(y, x = "a")

# ==============================================================================

# LIÇÃO: Você pode "extrair" um vetor de um tibble usando $

mytibble <- tibble(x = c(1, 2, 3, 4, 5), y = "test")

mytibble$x

mytibble$y

# ==============================================================================

# ARMADILHA: Não tente extrair um vetor que não existe

mytibble$z #error

Tabelas longas e largas (long x wide)

Tabelas longas e largas (long x wide)

Importação e Exportação

  • Os dados geralmente são armazenados em arquivos de dados
    • Importar arquivos para R é chamado reading
    • Exportar arquivos de R é chamado writing
  • Um tipo de arquivo de dados conveniente é um CSV
    • Isto significa valores separados por vírgula
    • Um arquivo CSV é fácil de compartilhar com outras pessoas
  • O pacote tidyverse pode ler/escrever CSVs
    • Outros pacotes podem ler/escrever outros tipos (e.g., readxl, haven, rio, googlesheets4)

Read/Write - Prática

# CONFIGURAÇÃO: Carregue o pacote tidyverse (se ainda não o fez)

library(tidyverse)

# ==============================================================================

# CASO DE USO: Crie um tibble e grave-o em um arquivo

gradebook <- tibble(
  id = c(123, 456, 789),
  grade = c("A", "B", "A")
)
gradebook

write_csv(gradebook, file = "gradebook.csv")

# NOTA: Você pode ver o novo arquivo no painel Extras > aba Arquivos.
# Você pode abrir o arquivo em outro programa (por exemplo, Microsoft Excel).
# Você também pode enviar este arquivo por e-mail para outra pessoa para compartilhá-lo.

# ==============================================================================

# ARMADILHA: Não troque a ordem do tibble e do file

write_csv("gradebook.csv", gradebook) # error

# ==============================================================================

# CASO DE USO: Ler em um arquivo contendo dados

old_gradebook <- read_csv("gradebook.csv")
old_gradebook

# NOTA: read_csv() examinará e adivinhará o tipo de dado de cada variável.
# Você pode dizer a ele o tipo de dado de cada variável, mas isso é mais avançado.

# ==============================================================================

# ARMADILHA: Não use as funções read.csv() e write.csv()

old_gradebook <- read.csv("gradebook.csv") # not a tibble
old_gradebook

Wrangle II

Basic wrangling verbs

  • tidyverse provides tools for wrangling tibbles
    • These functions are named after verbs
    • So if you name your objects after nouns
    • …your code becomes easier to read
Noun(noun) ❌ Verb(noun) ✔️
blender(fruit) blend(fruit)
screwdriver(screw) drive(screw)
boxcutter(box) cut(box)

Column-focused verbs

  • Select retains only certain columns/variables
    • select(TBL, VAR_KEEP, -VAR_DROP)
  • Mutate adds or transforms columns/variables
    • mutate(TBL, NEW_VAR = OLD_VAR / 1000)
  • Rename changes the names of columns/variables
    • rename(TBL, NEW_NAME = OLD_NAME)
  • Relocate changes the order of columns/variables
    • relocate(TBL, VAR_MOVE, .after = OTHER_VAR)

Select Live Coding

# SETUP: Load package and inspect example tibble

library(tidyverse) # includes the dplyr package
starwars

# ==============================================================================

# USECASE: Retain only the specified variables

sw <- select(starwars, name)
sw
sw <- select(starwars, name, sex, species)
sw

# ==============================================================================

# PITFALL: Don't forget to save the change with assignment

select(starwars, name, sex, species)
starwars # still includes all variables

# ==============================================================================

# USECASE: Retain all variables between two variables

sw <- select(starwars, name, hair_color:eye_color)
sw

# ==============================================================================

# USECASE: Retain all variables except the specified ones

sw <- select(starwars, -sex, -species)
sw
sw <- select(starwars, -c(sex, species))
sw
sw <- select(starwars, -c(hair_color:starships))
sw

Mutate Live Coding

# SETUP: Create example tibble

patients <- tibble(
  id = c("S1", "S2", "S3"),
  feet = c(6, 5, 5),
  inches = c(1, 7, 2),
  pounds = c(176.3, 124.9, 162.6)
)
patients

# ==============================================================================

# USECASE: Add one or more variables

p2 <- mutate(patients, sex = c("M", "F", "F"))
p2

ages <- c(32, 41, 29)
p2 <- mutate(patients, ages = ages)
p2

p2 <- mutate(
  patients, 
  sex = c("M", "F", "F"), 
  ages = ages
)
p2

# ==============================================================================

# USECASE: Compute variables

p2 <- mutate(patients, height = feet + inches / 12)
p2

p2 <- mutate(patients, ln_pounds = log(pounds))
p2

# ==============================================================================

# USECASE: Overwrite variables

patients <- mutate(patients, height = height / 3.281)
patients

# ==============================================================================

# USECASE: Duplicate variables

p2 <- mutate(patients, weight = pounds)
p2 # both weight and pounds exist

Rename / Relocate Live Coding

# USECASE: Change the name of one or more variables

starwars

sw <- rename(starwars, Character = name)
sw

sw <- rename(starwars, height_cm = height, mass_kg = mass)
sw

# ==============================================================================

# PITFALL: Don't swap the order and try old_name = new_name

sw <- rename(starwars, name = Character) # error

# ==============================================================================

# USECASE: Move variables before or after another variable

starwars

sw <- relocate(starwars, species, sex, .before = height)
sw

sw <- relocate(starwars, species, sex, .after = name)
sw

# ==============================================================================

# PITFALL: Don't forget the period!

sw <- relocate(starwars, sex, before = height) 
sw # height was accidentally renamed to before

Row-focused verbs

  • Arrange sorts rows based on their values
    • arrange(TBL, VAR_SORT_UP)
    • arrange(TBL, desc(VAR_SORT_DOWN))
    • arrange(TBL, VAR_SORT_1ST, VAR_SORT_2ND)
  • Filter retains certain rows based on criteria
    • filter(TBL, DBL_CRIT > 0)
    • filter(TBL, STR_CRIT == "A")
    • filter(TBL, CRIT1, CRIT2)

Arrange Live Coding

# USECASE: Sort observations by a variable

starwars

sw <- arrange(starwars, height)
sw # sorted by height, ascending

sw <- arrange(starwars, name)
sw # sorted by name, alphabetically

# ==============================================================================

# USECASE: Sort observations by a variable, in reverse order

sw <- arrange(starwars, desc(height))
sw # sorted by height, descending

sw <- arrange(starwars, desc(name))
sw # sorted by name, reverse-alphabetically

# ==============================================================================

# USECASE: Sort observations by multiple variables

sw <- arrange(starwars, hair_color, mass)
sw # sorted by hair_color, then ties broken by mass

Filter Live Coding

# USECASE: Retain only observations that meet a criterion

sw <- filter(starwars, mass > 100)
sw # only observations with mass greater than 100

sw <- filter(starwars, mass <= 100)
sw # only observations with mass less than or equal to 100

sw <- filter(starwars, species == "Human")
sw # only observations with species equal to Human

sw <- filter(starwars, species != "Human")
sw # only observations with species not equal to Human

# ==============================================================================

# PITFALL: Don't try to use a single = for testing equality

sw <- filter(starwars, height = 150) # error

sw <- filter(starwars, height == 150) # correct
sw 

# ==============================================================================

# PITFALL: Don't forget that R is case-sensitive

sw <- filter(starwars, species == "human")
sw # no observations left (because it should be Human)

# ==============================================================================

# USECASE: Retain only observations that meet complex criteria

sw <- filter(starwars, mass > 100 & height > 200)
sw # only observations with mass over 100 AND height over 200

sw <- filter(starwars, height < 100 | hair_color == "none")
sw # only observations with height under 100 OR hair_color equal to none

# ==============================================================================

# PITFALL: Don't forget to complete both conditions

sw <- filter(starwars, mass > 100 & < 200) # error

sw <- filter(starwars, mass > 100 & mass < 200) # correct
sw

# ==============================================================================

# PITFALL: Don't try to equate a string to a vector

sw <- filter(starwars, species == c("Human", "Droid")) # error

sw <- filter(starwars, species %in% c("Human", "Droid")) # correct
sw

Filter Cheatsheet

Symbol Description Num Chr
< Less than Yes No
<= Less than or equal to Yes No
> More than Yes No
>= More than or equal to Yes No
== Equal to Yes Yes
!= Not equal to Yes Yes
%in% Found in Yes Yes
& Logical And Yes Yes
| Logical Or Yes Yes

Wrangle III

Pipes & Pipelines

  • How can we do multiple operations to an object?
    1. x <- 10
    2. x2 <- sqrt(x)
    3. x3 <- round(x2)
  • This works but is cumbersome and error-prone
  • A better approach is to use pipes and pipelines
    • x3 <- 10 |> sqrt() |> round()
  • I like to read |> as “and then…”
    • “Take 10 and then sqrt() and then round()”

Pipes Live Coding

# SETUP: Enable the pipe operator shortcut

# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"

# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)

# ==============================================================================

# LESSON: The pipe pushes objects to a function as its first argument

# TEMPLATE: x |> function_name() is the same as function_name(x)

x <- 10

y <- sqrt(x)
y

y <- x |> sqrt()
y

# ==============================================================================

# PITFALL: Don't forget to remove the object from the function call

x |> sqrt(x) # wrong

x |> sqrt() # correct

# ==============================================================================

# USECASE: You can still use arguments when piping

z <- round(3.14, digits = 1)
z

z <- 3.14 |> round(digits = 1)
z

# ==============================================================================

# USECASE: Pipes are useful with tibbles and wrangling verbs

starwars

sw <- select(starwars, name, species, height)
sw

sw <- starwars |> select(name, species, height)
sw

# ==============================================================================

# PITFALL: Don't add a pipe without a step after it

sw <- starwars |> select(name, species, height) |> # error

Pipelines Live Coding

# USECASE: You can chain multiple pipes together to make a pipeline

x <- 10 |> sqrt() |> round()
x

# ==============================================================================

# TIP: If you want to see the output of a pipeline, you can pipe to print()

x <- 10 |> sqrt() |> round() |> print()

# ==============================================================================

# TIP: To make your pipelines more readable, move each step to a new line

x <- 
  10 |> 
  sqrt() |> 
  round() |>
  print()

# ==============================================================================

# PITFALL: Don't put the pipe at the beginning of a line, though

x <- 
  10 
  |> sqrt()
  |> round()
  |> print() # error

# ==============================================================================

# USECASE: Chain together a series of verbs to flexibly wrangle data

tallones <- 
  starwars |> 
  select(name, species, height) |> 
  rename(height_cm = height) |> 
  mutate(height_ft = height_cm / 30.48) |>  
  filter(height_ft > 7) |> 
  arrange(desc(height_ft)) |>  
  print()

Factors

  • Factors are used to represent categorical data
    • Factors have multiple possible levels
    • Levels are discrete and mutually-exclusive
  • Sometimes categories are unordered (nominal)
    • Action or Comedy or Drama
    • Asia or Europe or North America
  • Sometimes categories are ordered (ordinal)
    • Mild < Medium < Hot
    • XS < S < M < L < XL

Factors Live Coding

# USECASE: Ask 10 kids to order 1: nuggets, 2: pizza, or 3: salad

food <- c(2, 2, 1, 2, 1, 2, 1, 1, 2, 2)
food

# ==============================================================================

# LESSON: We can turn this vector into a factor with factor()

food2 <- factor(food, levels = c(1, 2, 3))
food2

food3 <- factor(food, levels = c(1, 2, 3),
                labels = c("nuggets", "pizza", "salad"))
food3

# ==============================================================================

# USECASE: We can also quickly and easily count each level with table()

table(food3)

# ==============================================================================

# PITFALL: Don't confuse levels and labels

food4 <- factor(food, labels = c(1, 2, 3),
                levels = c("nuggets", "pizza", "salad"))
food4 # full of <NA> because it can't find these levels

# ==============================================================================

# USECASE: You can also just enter strings directly (as self-labels)

genre <- c("pop", "metal", "pop", "rock", "rap", "rap", "pop", "rock")
genre

genre2 <- factor(genre) # observed levels will be assigned alphabetically
genre2

table(genre2)

# ==============================================================================

# LESSON: If ordinal, enter levels low-to-high and add ordered = TRUE

salsa <- c("hot", "mild", "medium", "mild", "medium", "medium")

salsa2 <- factor(salsa, 
                 levels = c("mild", "medium", "hot"), 
                 ordered = TRUE)
salsa2 

# NOTE: We may want to visualize or model ordinal factors differently

# ==============================================================================

# USECASE: Working with factors in a tibble

cereal <- read_csv("cereal.csv")
cereal

cereal2 <- mutate(cereal, mfr = factor(mfr), type = factor(type))
cereal2

table(cereal2$mfr)

table(cereal2$type)

Missing Values

  • Sometimes your data will have missing values
    • Perhaps these were never collected
    • Perhaps the values were lost/corrupted
    • Perhaps the participant didn’t respond
  • We need to tell R which values are missing
    • To do so, we set those values to NA
    • Functions from tidyverse make this easy
  • Missingness is often “contagious” in R
    e.g., a vector with NA has an unknown mean

Missing Values Live Coding

# SETUP: We will need tidyverse for the read and mutate functions

library(tidyverse)

# ==============================================================================

# PITFALL: Number codes for missingness will mess up calculations in R

heights <- c(149, 158, -999) # here we use -999 to represent a missing value

range(heights)

mean(heights)

log(heights) # our missing value is no longer -999

# ==============================================================================

# USECASE: Use NA for missingness instead

heights2 <- c(149, 158, NA)
heights2

log(heights2) # the NA stayed an NA (due to contagiousness)

# ==============================================================================

# LESSON: Use na.rm = TRUE to do a summary function ignoring the NAs

mean(heights2) # the mean is an NA (due to contagiousness)

mean(heights2, na.rm = TRUE)

range(heights2, na.rm = TRUE)

# ==============================================================================

# USECASE: Dealing with missing values in tibbles

cereal <- read_csv("cereal.csv")

cereal$rating

range(cereal$rating)

# ==============================================================================

# LESSON: Use na_if() to convert specific values to NA while mutating

cereal2 <- mutate(cereal, rating = na_if(rating, -999))

cereal2$rating

range(cereal2$rating, na.rm = TRUE)

# ==============================================================================

# LESSON: Use read_csv(na) to convert specific values to NA while reading

cereal3 <- read_csv("cereal.csv", na = "-999")

cereal3$rating

range(cereal3$rating, na.rm = TRUE)

Wrangle IV

Summarize

  • Although we store data about many observations…
  • …we often want to summarize across observations
    • This is like folding the tibble down to one row
  • We’ve seen functions that summarize vectors
    • length(), sum(), min(), max()
    • mean(), median(), sd(), var()
  • summarize() lets us use them on tibbles
    • It works very similarly to mutate()
    • It always creates a tibble as output

Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# USECASE: Summarize the typical sales

my_summary <- 
  sales |> 
  summarize(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# PITFALL: Don't use summary() instead of summarize()

my_summary <- 
  sales |> 
  summary(
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print() # not a tibble

# ==============================================================================

# USECASE: Use more than one summary function

my_summary <- 
  sales |> 
  summarize(
    total_items = sum(items),
    total_spent = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  ) |> 
  print()

# ==============================================================================

# USECASE: Use counting functions

my_counts <- 
  sales |> 
  summarize(
    n_sales = n(),
    n_customers = n_distinct(customer),
    n_stores = n_distinct(store)
  ) |> 
  print()

Group Summarize

  • We can also summarize a tibble by group
    • This is like folding the tibble multiple times
    • Specifically, we fold down to one row per group
  • The syntax for summarize is identical
    • The only difference is to the tibble
    • We first pass it through group_by()
    • Pipelines make this very easy

Group Summarize Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

sales <- 
  tibble(
    customer = c(1, 2, 3, 1, 3),
    store = c("A", "A", "A", "B", "B"),
    items = c(25, 20, 16, 10, 5),
    spent = c(685, 590, 392, 185, 123)
  ) |> 
  print()

# ==============================================================================

# LESSON: We pass a tibble through group_by to group it

sales

sales |> group_by(store) # note the display says "grouped"

# ==============================================================================

# USECASE: We can then summarize and get stats per group

sales |> 
  group_by(store) |> 
  summarize(
    customers = n_distinct(customer),
    items_sold = sum(items),
    total_sales = sum(spent),
    avg_items = mean(items),
    avg_spent = mean(spent)
  )

# ==============================================================================

# SETUP: Let's get a larger, more realistic dataset

# Extra pane > Packages tab > Install > nycflights13

library("nycflights13")

flights

# ==============================================================================

# USECASE: Find the carrier with the lowest average delays

flights |> 
  group_by(carrier) |> 
  summarize(m_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(m_delay)

# ==============================================================================

# LESSON: We can also group by multiple variables

# USECASE: Let's find the day of the year with the most flights

flights |> 
  group_by(month, day) |> 
  summarize(n_flights = n()) |> 
  arrange(desc(n_flights))

Visualize I

What is a graphic?

A data visualization expresses data through visual aesthetics.

Describing Graphics

Some simple graphics are easy to describe and may even have ready names.

Describing Graphics

A grammar of graphics will help us describe more complex graphics.

The Grammar of Graphics

  • The grammar of graphics is a set of rules for describing and creating data visualizations
  • To make our data visual (and therefore put our highly evolved occipital lobes to work)…
    • We connect variables to visual qualities
    • We represent observations as visual objects
  • This requires some fundamental elements
    • We will first learn about them in lecture
    • We will then apply them in R using {ggplot2}

Data

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Graphics require data (e.g., tibbles), which describe observations using variables.

Aesthetic Mappings

Graphics require aesthetic mappings, which connect data variables to visual qualities.

Scales

Graphics require scales, which connect specific data values to specific aesthetic values.

Geometric Objects

Graphics require geometric objects (geoms), which represent the observations.

ggplot2 Basics

  • The ggplot2 package is a part of tidyverse
    • No need to install or load it separately
    • It plays nicely with tibbles and wrangling
  • It implements the grammar of graphics in R
    • The “gg” stands for “grammar of graphics”
    • Thus, we will need to provide all four elements
  • We will create a pseudo-pipeline of commands
    • However, we will use + rather than |>
    • This is because {ggplot2} predates the R pipe

ggplot2 Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# LESSON: First, set the data to a tibble
p <- ggplot(data = mpg)
p

# ==============================================================================

# LESSON: Next, set the aesthetic mappings with aes()

p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
p

# ==============================================================================

# TIP: You can leave off the optional argument names

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

# ==============================================================================

# LESSON: Next, set the positional scales

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  )
p

# ==============================================================================

# LESSON: Finally, add a point geom

p <- 
  ggplot(mpg, aes(x = displ, y = hwy)) + 
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  ) +
  geom_point()

# ==============================================================================

# TIP: If you leave off the scales, R will try to guess

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p

# ==============================================================================

# LESSON: We can also customize the geom with arguments

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(color = "red", shape = "square", size = 2)
p

Basic Layering

  • ggplot2 uses a layered grammar of graphics
    • We can keep stacking geoms on top
  • Layering adds a lot of possibilities
    • We can convey more complex ideas
    • We can learn more about our data
  • But we can still describe these graphics
    • Just describe each layer in turn
    • And describe the layers’ ordering

Basic Layering Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Add a smooth geom (i.e., line of best fit)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

# ==============================================================================

# USECASE: Add a line geom (i.e., connecting points)

economics

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point() +
  geom_line(color = "orange", size = 1)

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_line(color = "orange", size = 1) +
  geom_point()

# ==============================================================================

# USECASE: Add reference line geoms

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_vline(xintercept = 7.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_abline(intercept = 4000, slope = 0.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

Working with Color

  • Color scales come in two main types:
    • Discrete scales have separate colors
      • Best with factor variables
    • Continuous scales form a gradient
      • Best with numeric variables
  • There are two ways to control color:
    • You can map color to a variable
      • It will take on different values
    • You can set color to a value
      • It will take on one value only

Color Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Continuous color scales work well with numeric variables

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4)

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4) +
  scale_color_continuous(type = "viridis")

# ==============================================================================

# USECASE: Use a discrete color scale with categorical variables

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  scale_color_discrete(
    name = "Drivetrain", 
    breaks = c("4", "f", "r"), 
    labels = c("Four Wheel", "Front Wheel", "Rear Wheel")
  )

# ==============================================================================

# PITFALL: Don't forget to set categorical variables as factors

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) + 
  geom_point() # R guesses you want a continuous scale

ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) + 
  geom_point() + 
  scale_color_discrete(name = "Cylinders")

# ==============================================================================

# LESSON: Set a geom's color aesthetic to make it always that color

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "red")

# ==============================================================================

# PITFALL: However, do this inside of geom() not aes()

ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + 
  geom_point() #unintended

# ==============================================================================

# LESSON: If you both set and map color, the setting will win

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point(color = "blue") 

Themes

  • Themes control how non-data elements look
    • e.g., how thick to draw the gridlines
    • e.g., where to position the legend
  • Complete themes change many elements at once
    • Some are built into ggplot2
    • Others come in R packages
    • {papaja} provides theme_apa()
  • Individual elements can be customized too

Themes Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- 
  ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  labs(title = "Fuel Efficiency")
p

# ==============================================================================

# USECASE: Apply a "complete" theme

p + theme_bw()

p + theme_classic()

p + theme_dark()

# ==============================================================================

# LESSON: More more precise control, we can use theme()

p + theme(legend.position = "top")

p + theme(plot.title = element_text(color = "purple", face = "bold"))

p + theme(panel.grid = element_blank())

# NOTE: There are a lot of elements to learn, so use a cheatsheet!

Exporting Graphics

  • We may need to export graphics from R
    • e.g., for a paper, poster, or presentation
  • This job is handling fantastically by ggsave()
    • We can create many types of files
    • We can customize the exact size
  • I recommend .png for most daily purposes
    • For publishing, I prefer .pdf or .svg
    • They retain perfect quality at any zoom
    • You can send these files to most publishers

Exporting Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth() +
  labs(x = "Engine Displacement", y = "Highway MPG")
p

# ==============================================================================

# USECASE: Save a specific ggplot object to a file

ggsave(filename = "pfinal.png", plot = p)

# ==============================================================================

# LESSON: Specify the size of the file to create

ggsave(filename = "pfinal2.png", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# LESSON: Just change the extension to create a different file type

ggsave(filename = "pfinal2.pdf", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# PITFALL: Creating a very large file may lead to small text

ggsave(filename = "p_poster.png", plot = p, 
       width = 12, height = 8, units = "in")

# ==============================================================================

# TIP: You can quickly increase the text size using base_size

p2 <- p + theme_grey(base_size = 24)

ggsave(filename = "p_poster2.png", plot = p2,
       width = 12, height = 8, units = "in")